Method of mapping restriction sites in polynucleotides

ABSTRACT

The invention provides a method for constructing a high resolution physical map of a polynucleotide. In accordance with the invention, the polynucleotide is digested successively with at least two different restriction endonucleases and the ends of the restriction fragments are sequenced after each digestion. In this manner, restriction fragments having sequenced ends are produced that can be aligned by their sequences to give a physical map of the polynucleotide. Preferably, restriction fragment ends are sequenced by massively parallel signature sequencing (MPSS), or a like parallel sequencing technique.

This is a continuation-in-part of U.S. patent application Ser. No.08/884,189 filed Jun. 27, 1997, now abandoned, which is incorporatedherein by reference.

FIELD OF THE INVENTION

The invention relates generally to methods for construction physicalmaps of genomic DNA, and more particularly, to a method of providinghigh resolution physical maps using a parallel DNA sequencingtechnology, such as massively parallel signature sequencing (MPSS).

BACKGROUND

Physical maps of one or more large pieces of DNA, such as a genome orchromosome, consist of an ordered collection of molecular landmarks thatmay be used to position, or map, a smaller fragment, such as clonecontaining a gene of interest, within the larger structure, e.g. U.S.Department of Energy, “Primer on Molecular Genetics,” from Human Genome1991-92 Program Report; and Los Alamos Science, 20: 112-122 (1992). Animportant goal of the Human Genome Project has been to provide a seriesof genetic and physical maps of the human genome with increasingresolution, i.e. with reduced distances in basepairs between molecularlandmarks, e.g. Murray et al, Science, 265: 2049-2054 (1994); Hudson etal, Science, 270: 1945-1954 (1995); Schuler et al, Science, 274: 540-546(1996); and so on. Such maps have great value not only in furthering ourunderstanding of genome organization, but also as tools for helping tofill contig gaps in large-scale sequencing projects and as tools forhelping to isolate disease-related genes in positional cloning projects,e.g. Rowen et al, pages 167-174, in Adams et al, editors, Automated DNASequencing and Analysis (Academic Press, New York, 1994); Collins,Nature Genetics, 9: 347-350 (1995); Rossiter and Caskey, Annals ofSurgical Oncology, 2: 14-25 (1995); and Schuler et al (cited above). Inboth cases, the ability to rapidly construct high-resolution physicalmaps of large pieces of genomic DNA is highly desirable.

Two important approaches to genomic mapping include the identificationand use of sequence tagged sites (STS's), e.g. Olson et al, Science,245: 1434-1435 (1989); and Green et al, PCR Methods and Applications, 1:77-90 (1991), and the construction and use of jumping and linkinglibraries, e.g. Collins et al, Proc. Natl. Acad. Sci., 81: 6812-6816(1984); and Poustka and Lehrach, Trends in Genetics, 2: 174-179 (1986).The former approach makes maps highly portable and convenient, as mapsconsist of ordered collections of nucleotide sequences that allowapplication without having to acquire scarce or specialized reagents andlibraries. The latter approach provides a systematic means foridentifying molecular landmarks spanning large genetic distances and forordering such landmarks via hybridization assays with members of alinking library.

Unfortunately, these approaches to mapping genomic DNA are difficult andlaborious to implement. It would be highly desirable if there was anapproach for constructing physical maps that combined the systematicquality of the jumping and linking libraries with the convenience andportability of the STS approach.

SUMMARY OF THE INVENTION

Accordingly, an object of my invention is to provide a method forconstructing high resolution physical maps of genomic DNA.

Another object of my invention is to provide a method mapping genomicDNA by massively parallel signature sequencing of restriction fragmentsof the genomic DNA.

Another object of my invention is to provide a method of orderingrestriction fragments by aligning matching sequences of their ends.

A further object of my invention is to provide physical maps of genomicDNA that consist of an ordered collection of nucleotide sequences spacedat an average distance of a few kilobases or less.

My invention achieves these and other objects by providing a method forconstructing a physical map of a polynucleotide. In accordance with theinvention, a polynucleotide is digested successively with at least twodifferent restriction endonucleases and the ends of the restrictionfragments are sequenced after each digestion. In this manner,restriction fragments having sequenced ends are produced that can bealigned by their sequences to give a physical map of the polynucleotide.Preferably, restriction fragment ends are sequenced by massivelyparallel signature sequencing (MPSS), or a like parallel sequencingtechnique.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A-1D graphically illustrate the concept of the invention.

FIG. 2 illustrates the effect of the occurrence of multiple restrictionrecognition sites of a second restriction endonuclease between twoconsecutive restriction recognition sites of a first restrictionendonuclease.

FIG. 3 is a schematic representation of a flow chamber and detectionapparatus for observing a planar array of microparticles loaded withrestriction fragments for sequencing.

FIG. 4 illustrates an embodiment of the invention which employs partialmethylation to generate cleavable fragments for sequencing.

DEFINITIONS

As used herein, the term “ligation” means the formation of a covalentbond between the ends of one or more (usually two) oligonucleotides. Theterm usually refers to the formation of a phosphodiester bond resultingfrom the following reaction, which is usually catalyzed by a ligase:

oligo₁(5′)-OP(O—)(═O)O+HO-(3′)oligo₂-5′→oligo₁(5′)-OP(O—)(═O)O-(3′)oligo₂-5′

where oligo₁ and oligo₂ are either two different oligonucleotides ordifferent ends of the same oligonucleotide. The term encompassesnon-enzymatic formation of phosphodiester bonds, as well as theformation of non-phosphodiester covalent bonds between the ends ofoligonucleotides, such as phosphorothioate bonds, disulfide bonds, andthe like. A ligation reaction is usually template driven, in that theends of oligo₁ and oligo₂ are brought into juxtaposition by specifichybridization to a template strand. A special case of template-drivenligation is the ligation of two double stranded oligonucleotides havingcomplementary protruding strands.

“Complement” or “tag complement” as used herein in reference tooligonucleotide tags refers to an oligonucleotide to which aoligonucleotide tag specifically hybridizes to form a perfectly matchedduplex or triplex. In embodiments where specific hybridization resultsin a triplex, the oligonucleotide tag may be selected to be eitherdouble stranded or single stranded. Thus, where triplexes are formed,the term “complement” is meant to encompass either a double strandedcomplement of a single stranded oligonucleotide tag or a single strandedcomplement of a double stranded oligonucleotide tag.

The term “oligonucleotide” as used herein includes linear oligomers ofnatural or modified monomers or linkages, includingdeoxyribonucleosides, ribonucleosides, anomeric forms thereof, peptidenucleic acids (PNAs), and the like, capable of specifically binding to atarget polynucleotide by way of a regular pattern of monomer-to-monomerinteractions, such as Watson-Crick type of base pairing, base stacking,Hoogsteen or reverse Hoogsteen types of base pairing, or the like.Usually monomers are linked by phosphodiester bonds or analogs thereofto form oligonucleotides ranging in size from a few monomeric units,e.g. 3-4, to several tens of monomeric units, e.g. 40-60. Whenever anoligonucleotide is represented by a sequence of letters, such as“ATGCCTG,” it will be understood that the nucleotides are in 5′→3′ orderfrom left to right and that “A” denotes deoxyadenosine, “C” denotesdeoxycytidine, “G” denotes deoxyguanosine, and “T” denotes thymidine,unless otherwise noted. Usually oligonucleotides of the inventioncomprise the four natural nucleotides; however, they may also comprisenon-natural nucleotide analogs. It is clear to those skilled in the artwhen oligonucleotides having natural or non-natural nucleotides may beemployed, e.g. where processing by enzymes is called for, usuallyoligonucleotides consisting of natural nucleotides are required.

“Perfectly matched” in reference to a duplex means that the poly- oroligonucleotide strands making up the duplex form a double strandedstructure with one other such that every nucleotide in each strandundergoes Watson-Crick basepairing with a nucleotide in the otherstrand. The term also comprehends the pairing of nucleoside analogs,such as deoxyinosine, nucleosides with 2-aminopurine bases, and thelike, that may be employed. In reference to a triplex, the term meansthat the triplex consists of a perfectly matched duplex and a thirdstrand in which every nucleotide undergoes Hoogsteen or reverseHoogsteen association with a basepair of the perfectly matched duplex.Conversely, a “mismatch” in a duplex between a tag and anoligonucleotide means that a pair or triplet of nucleotides in theduplex or triplex fails to undergo Watson-Crick and/or Hoogsteen and/orreverse Hoogsteen bonding.

As used herein, “nucleoside” includes the natural nucleosides, including2′-deoxy and 2′-hydroxyl forms, e.g. as described in Kornberg and Baker,DNA Replication, 2nd Ed. (Freeman, San Francisco, 1992). “Analogs” inreference to nucleosides includes synthetic nucleosides having modifiedbase moieties and/or modified sugar moieties, e.g described by Scheit,Nucleotide Analogs (John Wiley, New York, 1980); Uhlman and Peyman,Chemical Reviews, 90: 543-584 (1990), or the like, with the only provisothat they are capable of specific hybridization. Such analogs includesynthetic nucleosides designed to enhance binding properties, reducecomplexity, increase specificity, and the like.

As used herein “sequence determination” or “determining a nucleotidesequence” in reference to polynucleotides includes determination ofpartial as well as full sequence information of the polynucleotide. Thatis, the term includes sequence comparisons, fingerprinting, and likelevels of information about a target polynucleotide, as well as theexpress identification and ordering of nucleosides, usually eachnucleoside, in a target polynucleotide. The term also includes thedetermination of the identification, ordering, and locations of one,two, or three of the four types of nucleotides within a targetpolynucleotide. For example, in some embodiments sequence determinationmay be effected by identifying the ordering and locations of a singletype of nucleotide, e.g. cytosines, within the target polynucleotide“CATCGC . . . ” so that its sequence is represented as a binary code,e.g. “100101 . . . ” for “C—(not C)—(not C)—C—(not C)—C . . . ” and thelike.

As used herein, the term “complexity” in reference to a population ofpolynucleotides means the number of different species of polynucleotidepresent in the population.

DETAILED DESCRIPTION OF THE INVENTION

In accordance with the present invention, nucleotide sequences at theends of restriction fragments are used to order the fragments into aphysical map. Preferably, a target polynucleotide is digested with atleast two different restriction endonucleases (or at least onerestriction endonuclease and its cognate methylase), after which theends of the resulting fragments are sequenced.

The concept of the invention may be illustrated by considering the idealsituation of FIG. 1A, where polynucleotide (10) has recognition sites(r₁, r₂, r₃, r₄, and r₅) for restriction endonucleases r and recognitionsites (e₁ through e₅) for restriction endonuclease e, such that thesites of the two restriction endonucleases alternate. That is, betweenany two consecutive sites for r there is exactly one site for e, andbetween any two consecutive sites for e there is exactly one site for r.A sample of polynucleotide (10) is digested with r (11) to producefragments (12), each of the fragments having a single recognition site(14) for e. As illustrated in FIG. 1C and described in more detailbelow, each fragment, e.g. (49), is preferably anchored by an end (51)to a solid phase support (50), after which the nucleotide sequence (57)of the free end (52) is determined. Of course, any nucleotide sequencingmethod can be employed, but as explained more fully below, the mostuseful application of the invention can be made when a technique isemployed that permits many thousands of fragments to be sequenced at thesame time.

Once the sequence of each free end (52) is obtained, the fragments aredigested (54) with e, and the nucleotide sequence (61) of new free end(56) is determined. In the preferred embodiment, this process is carriedout on each fragment in both orientations with respect to which end isanchored to the solid phase support as a result of the sequencingapproach employed, although such “double” sequencing is not necessaryfor the invention. It is merely a consequence of the use of MPSS todetermine the sequences. That is, separately, fragment (49) is anchoredby end (52) and the nucleotide sequence (58) of free end (51) isdetermined, after which fragment (49) is digested with e to produce newfree end (59). The nucleotide sequence (62) of the new free end (59) isthen determined. The locations (64) of the sequence elements (51), (57),(61), and (62) are summarized at the bottom of FIG. 1C.

A consequence of the “over-determination” of sequence information byMPSS is that two independent physical maps are produced simultaneously.Generally, one map consists of the sequences on one side of eachrestriction cleavage, and the other map consists of the sequences on theother side of each restriction cleavage.

Returning to FIGS. 1A and 1B, the locations of the ordered pairs ofsequences (18) of the fragments (12) and their relative positions areillustrated. However, the ordered pairs are not linked. Linkinginformation is obtained by digesting another sample of polynucleotide(10) with e (20) to form fragments (22), each of which has a singlerecognition site (25) for r. Ordered pairs of sequences (26) areobtained, after processing fragments (22) as fragments (12) wereprocessed, with the exception that the second digestion is with r. Ifordered pairs (26) are combined with ordered pairs (18), as shown inFIG. 1D, a physical map (30) is obtained.

As illustrated by polynucleotide (10′) of FIG. 2, when multiplerecognition sites (200), e.g. e₆, e₇, e₈, and e₂, of one of therestriction endonucleases occurs between two consecutive recognitionsites of the other restriction endonuclease, some fragments (202) willnot give rise to ordered pairs of sequences. Such sequences can simplybe ignored when ordered pairs are assembled into a physical map.

In many cases, a pattern of recognition sites of two or more restrictionendonucleases may be converted into the alternating pattern of FIGS.1A-1D by constructing jumping and linking libraries. This is especiallyin the case where at least one restriction endonuclease is a “rarecutter” and the rest are “frequent cutters,” e.g. as for a restrictionendonuclease with a 6- or 8-basepair recognition sequence and those witha 4-basepair recognition sequence, respectively. Jumping and linkinglibraries also allow sequence analysis of shorter fragments when one ofthe restriction endonuclease give rise to unmanageably large fragmentswith respect to the sequencing technique employed. Preferably, for MPSSthe fragment should be less than a few kilobases in length; morepreferably, they should be less than 2 kilobases in length; and stillmore preferably, the fragments should be less than 1.5 kilobases inlength.

Preferably, jumping and linking libraries are prepared as described byin the following references: Collins et al, Proc. Natl. Acad. Sci., 81:6812-6816 (1984); Poustka and Lehrach, Genetic Engineering, 10: 169-193(1988); and Poustka and Lehrach, Trends in Genetics, 2: 174179 (1986).Briefly, a first and a second restriction endonuclease are selected sothat the second restriction endonuclease cleaves a polynucleotide muchmore frequently than the first restriction endonuclease, e.g. the firstrestriction endonuclease may recognize a six-basepair sequence and thesecond restriction endonuclease may recognize a four-basepair sequence.Preferably, the second restriction endonuclease is selected so thatthere is at least one recognition site for the second restrictionendonuclease between every two consecutive recognition sites of thefirst restriction endonuclease. The polynucleotide is digested with thefirst restriction endonuclease, after which the restriction fragmentsare re-ligated at low concentration in the presence of a selectablemarker, so that single-fragment circles with a selectable marker are thepredominant ligation product. The ligation products are digested withthe second restriction endonuclease and the resulting fragments areinserted into a first cloning vector. The selectable marker must notcontain recognition sites of the second endonuclease. Clones selected bythe marker form a jumping library, so-named because the inserts of theclones contain sequences adjacent to consecutive recognition sites ofthe first restriction endonuclease and of the immediately neighboringrecognition sites of the second restriction endonuclease, but everything else has been deleted, or “jumped” over, effectively resulting ina configuration of alternating recognition sites, similar to that ofFIGS. 1A-1D.

Separately the polynucleotide is digested with the second endonuclease,after which the restriction fragments are re-ligated at lowconcentration in the presence of a selectable marker, again so thatsingle-fragment circles with a selectable marker are the predominantligation product. Clones selected by the marker form a linking library,so-named because the inserts of the clones contain sequences adjacent torecognition sites of the second restriction endonuclease immediatelyupstream and downstream of a recognition site of the first restrictionendonuclease; thus, it has sequences common to, or “linking,”consecutive recognition sites of the first restriction endonuclease.

Once the jumping and linking libraries are constructed, a physical mapmay be made by excising the inserts of the first and second plasmids andby carrying out the process described for the embodiment of FIGS. 1A-1D;namely, tagging, cloning, sampling, and sorting, in accordance withBrenner et al. (cited below), followed by sequencing, digesting, andsequencing to form ordered pairs of sequences, which are assembled intoa physical map.

The number of nucleotides identified in the regions adjacent to eachrestriction site depends on the size of the polynucleotide being mappedand the number of fragments generated by the restriction digests.Preferably, a sufficient number of nucleotides are identified so thateach of the determined sequences is unique, so as to avoid ambiguoussolutions when ordered pairs are assembled into a physical map. Thus,for cosmid-sized polynucleotides cleaved with a restriction endonucleasethat recognizes a four basepair sequence (a “4-cutter”), about 160(≈40,000/256) fragments are produced on average, so the number ofnucleotides determined could be as low as five. If the targetpolynucleotide is a bacterial genome of 1 megabase for the samerestriction endonuclease, about 4000 fragments are generated (or about8000 ends) and the number of nucleotides determined could be as low asseven, and still have a significant probability that each end sequencewould be unique. Preferably, for polynucleotides less than or equal to10 megabases, at least 9 nucleotides are determined in the regionsadjacent to restriction sites, when a 4-cutter restriction endonucleaseis employed. Generally for polynucleotides less than or equal to 10megabases, 12-18 nucleotides are preferably determined to ensure thatthe end sequences are unique. For polynucleotides greater than 10megabases, from 18-24 nucleotides are preferably determined.

Determination of Restriction Fragment Sequences by Massively ParallelSignature Sequencing (MPSS)

Preferably, ordered pairs of sequences are obtained from restrictionfragments by MPSS, which is a combination of two techniques: one fortagging and sorting fragments of DNA for parallel processing (e.g.Brenner et al., PCT Pubn. No. WO 96/41011), and another for the stepwisesequencing of the end of a DNA fragment (e.g. Brenner, U.S. Pat. No.5,599,675). After an initial digestion of a target polynucleotide with afirst restriction endonuclease, restriction fragments are ligated tooligonucleotide tags as described below, and in Brenner et al., PCTPubn. No. WO 96/41011, so that the resulting tag-fragment conjugates maybe sampled, amplified, and sorted onto separate solid phase supports byspecific hybridization of the oligonucleotide tags with their tagcomplements.

Once an amplified sample of fragments is sorted onto solid phasesupports to form homogeneous populations of substantially identicalfragments, the ends of the fragments are preferably sequenced with anadaptor-based method of DNA sequencing that includes repeated cycles ofligation, identification, and cleavage, such as the method described inBrenner, U.S. Pat. No. 5,599,675. In further preference, adaptors usedin the sequencing method each have a protruding strand and anoligonucleotide tag selected from a minimally cross-hybridizing set ofoligonucleotides (described more fully below). Such adaptors arereferred to herein as “encoded adaptors.” Encoded adaptors whoseprotruding strands form perfectly matched duplexes with thecomplementary protruding strands of a fragment are ligated. Afterligation, the identity and ordering of the nucleotides in the protrudingstrand is determined, or “decoded,” by specifically hybridizing alabeled tag complement, or “de-coder” to its corresponding tag on theligated adaptor.

The preferred sequencing method is carried out with the following steps:(a) ligating an encoded adaptor to an end of a fragment, the encodedadaptor having a nuclease recognition site of a nuclease whose cleavagesite is separate from its recognition site; (b) identifying one or morenucleotides at the end of the fragment by the identity of the encodedadaptor ligated thereto; (c) cleaving the fragment with a nucleaserecognizing the nuclease recognition site of the encoded adaptor suchthat the fragment is shortened by one or more nucleotides; and (d)repeating said steps (a) through (c) until said nucleotide sequence ofthe end of the fragment is determined. In the identification step,successive sets of tag complements, or “de-coders,” are specificallyhybridized to the respective tags carried by encoded adaptors ligated tothe ends of the fragments. The type and sequence of nucleotides in theprotruding strands of the polynucleotides are identified by the labelcarried by the specifically hybridized de-coder and the set from whichthe de-coder came, as described below.

Oligonucleotide Tags and Tag Complements

Oligonucleotide tags are employed for two different purposes in thepreferred embodiments of the invention: Oligonucleotide tags areemployed as described in Brenner, U.S. Pat. No. 5,604,097; and PCT Pubn.No. WO 96/41011, to sort large numbers of polynucleotides, e.g. severalthousand to several hundred thousand, from a mixture into uniformpopulations of identical polynucleotides for analysis, and they areemployed to deliver labels to encoded adaptors that number in the rangeof a few tens to a few thousand. For the former use, large numbers, orrepertoires, of tags are typically required, and therefore synthesis ofindividual oligonucleotide tags is problematic. In these embodiments,combinatorial synthesis of the tags is preferred. On the other hand,where extremely large repertoires of tags are not required—such as fordelivering labels to encoded adaptors, oligonucleotide tags of aminimally cross-hybridizing set may be separately synthesized, as wellas synthesized combinatorially.

Sets containing several hundred to several thousands, or even severaltens of thousands, of oligonucleotides may be synthesized directly by avariety of parallel synthesis approaches, e.g. as disclosed in Frank etal., U.S. Pat. No. 4,689,405; Frank et al., Nucleic Acids Research 11:4365-4377 (1983); Matson et al., Anal. Biochem. 224: 110-116 (1995);Fodor et al., PCT Pubn. No. WO 93/22684; Pease et al., Proc. Natl. Acad.Sci. 91: 5022-5026 (1994); Southern et al., J. Biotechnology 35: 217-227(1994), Brennan, PCT Pubn. No. WO 94/27719; Lashkari et al., Proc. Natl.Acad. Sci. 92: 7912-7915 (1995); or the like.

Preferably, tag complements in mixtures, whether synthesizedcombinatorially or individually, are selected to have similar duplex ortriplex stabilities to one another so that perfectly matched hybridshave similar or substantially identical melting temperatures. Thispermits mis-matched tag complements to be more readily distinguishedfrom perfectly matched tag complements when applied to encoded adaptors,e.g. by washing under stringent conditions. For combinatoriallysynthesized tag complements, minimally cross-hybridizing sets may beconstructed from subunits that make approximately equivalentcontributions to duplex stability as every other subunit in the set.Guidance for carrying out such selections is provided by publishedtechniques for selecting optimal PCR primers and calculating duplexstabilities, e.g. Rychlik et al, Nucleic Acids Research, 17: 8543-8551(1989) and 18: 6409-6412 (1990); Breslauer et al, Proc. Natl. Acad.Sci., 83: 3746-3750 (1986); Wetmur, Crit. Rev. Biochem. Mol. Biol., 26:227-259 (1991); and the like. When smaller numbers of oligonucleotidetags are required, such as for delivering labels to encoded adaptors,the computer programs of Appendices I and II may be used to generate andlist the sequences of minimally cross-hybridizing sets ofoligonucleotides that are used directly (i.e. without concatenation into“sentences”). Such lists can be further screened for additionalcriteria, such as GC-content, distribution of mismatches, theoreticalmelting temperature, and the like, to form additional minimallycross-hybridizing sets.

For shorter tags, e.g. about 30 nucleotides or less, the algorithmdescribed by Rychlik and Wetmur is preferred for calculating duplexstability, and for longer tags, e.g. about 30-35 nucleotides or greater,an algorithm disclosed by Suggs et al, pages 683-693 in Brown, editor,ICN-UCLA Symp. Dev. Biol., Vol. 23 (Academic Press, New York, 1981) maybe conveniently employed. Clearly, there are many approaches availableto one skilled in the art for designing sets of minimallycross-hybridizing subunits within the scope of the invention. Forexample, to minimize the affects of different base-stacking energies ofterminal nucleotides when subunits are assembled, subunits may beprovided that have the same terminal nucleotides. In this way, whensubunits are linked, the sum of the base-stacking energies of all theadjoining terminal nucleotides will be the same, thereby reducing oreliminating variability in tag melting temperatures.

The oligonucleotide tags of the invention and their complements areconveniently synthesized on an automated DNA synthesizer, e.g. anApplied Biosystems, Inc. (Foster City, Calif.) model 392 or 394 DNA/RNASynthesizer, using standard chemistries, such as phosphoramiditechemistry, e.g. disclosed in the following references: Beaucage andIyer. Tetrahedron, 48: 2223-2311 (1992); Molko et al, U.S. Pat. No.4,980,460; Koster et al, U.S. Pat. No. 4,725,677; Caruthers et al, U.S.Pat. Nos. 4,415,732; 4,458,066; and 4,973,679; and the like.

Oligonucleotide tags for sorting may range in length from 12 to 60nucleotides or basepairs. Preferably, oligonucleotide tags range inlength from 18 to 40 nucleotides or basepairs. More preferably,oligonucleotide tags range in length from 25 to 40 nucleotides orbasepairs. In terms of preferred and more preferred numbers of subunits,these ranges may be expressed as follows:

TABLE III Numbers of Subunits in Tags in Preferred Embodiments MonomersNucleotides in Oligonucleotide Tag in Subunit (12-60) (18-40) (25-40) 34-20 subunits 6-3 subunits 8-13 subunits 4 3-15 subunits 4-10 subunits6-10 subunits 5 2-12 subunits 3-8 subunits 5-8 subunits 6 2-10 subunits3-6 subunits 4-6 subunits

Most preferably, oligonucleotide tags for sorting are single strandedand specific hybridization occurs via Watson-Crick pairing with a tagcomplement.

Preferably, repertoires of single stranded oligonucleotide tags forsorting contain at least 100 members; more preferably, repertoires ofsuch tags contain at least 1000 members; and most preferably,repertoires of such tags contain at least 10,000 members.

Preferably, repertoires of tag complements for delivering labels containat least 16 members; more preferably, repertoires of such tags containat least 64 members. Still more preferably, such repertoires of tagcomplements contain from 16 to 1024 members, e.g. a number foridentifying nucleotides in protruding strands of from 2 to 5 nucleotidesin length. Most preferably, such repertoires of tag complements containfrom 64 to 256 members. Repertoires of desired sizes are selected bydirectly generating sets of words, or subunits, of the desired size,e.g. with the help of the computer programs of disclosed by Brenner etal (cited above), or repertoires are formed generating a set of wordswhich are then used in a combinatorial synthesis scheme to give arepertoire of the desired size. Preferably, the length of singlestranded tag complements for delivering labels is between 8 and 20. Morepreferably, the length is between 9 and 15.

In embodiments where specific hybridization occurs via triplexformation, coding of tag sequences follows the same principles as forduplex-forming tags; however, there are further constraints on theselection of subunit sequences. Generally, third strand association viaHoogsteen type of binding is most stable along homopyrimidine-homopurinetracks in a double stranded target. Usually, base triplets form in T-A*Tor C-G*C motifs (where “-” indicates Watson-Crick pairing and “*”indicates Hoogsteen type of binding); however, other motifs are alsopossible. For example, Hoogsteen base pairing permits parallel andantiparallel orientations between the third strand (the Hoogsteenstrand) and the purine-rich strand of the duplex to which the thirdstrand binds, depending on conditions and the composition of thestrands. There is extensive guidance in the literature for selectingappropriate sequences, orientation, conditions, nucleoside type (e.g.whether ribose or deoxyribose nucleosides are employed), basemodifications (e.g. methylated cytosine, and the like) in order tomaximize, or otherwise regulate, triplex stability as desired inparticular embodiments. Conditions for annealing single-stranded orduplex tags to their single-stranded or duplex complements are wellknown, e.g. Ji et al, Anal. Chem. 65: 1323-1328 (1993). Cantor et al,U.S. Pat. No. 5,482,836; and the like. Use of triplex tags in sortinghas the advantage of not requiring a “stripping” reaction withpolymerase to expose the tag for annealing to its complement.

An exemplary tag library for sorting is constructed as follows. Amixture of 8-word tags of nucleotides A, G, and T are chemicallysynthesized in accordance with the formula:

3′-AATT-[⁴(A,C,T)₈]-CCCT_(p)

where “[⁴⁽(A,G,T)₈]” indicates a tag mixture where each tag consists ofeight 4-mer words of A, G, and T; and “p” indicate a 5′ phosphate. Thismixture is ligated to the following right and left primer bindingregions (SEQ ID NO: 1& 2):

5′- AGAATTCGGGCCTTAATTAA 5′- GGGTACCAAGTCAGAGTGAT     TCACCGACCCGGAATTp        TGGTTCAGTCTCACTA            LEFT               RIGHT

The right and left primer binding regions are ligated to the above tagmixture, after which the single stranded portion of the ligatedstructure is filled with DNA polymerase then mixed with the right andleft primers indicated below and amplified to give a tag library.

Formula I      Left Primer           Kpn I 5′- AGAATTCGGGCCTTAATTAA            ↓ 5′- AGAATTCGGGCCTTAATTAA- [⁴(A,C,T)₈]-GGGTACCAAGTCAGAGTGAT    TCTTAAGCCCGGAATTAATT- [⁴(T,G,A)₈]-CCCATGGTTCAGTCTCACTA        ↑             ↑             CCCATGGTTCAGTCTCACTA -5′      EcoRI         Pac I                        Right Primer

The flanking regions of the oligonucleotide tag may be engineered tocontain restriction site, as exemplified above, for convenient insertioninto and excision from cloning vectors. Optionally, the right or leftprimers may be synthesized with a biotin attached (using conventionalreagents, e.g. available from Clontech Laboratories, Palo Alto, Calif.)to facilitate purification after amplification and/or cleavage.Preferably, for making tag-fragment conjugates, the above library isinserted into a conventional cloning vector, such a pUC19, or the like.

A general method for exposing the single stranded tag involves digestingtag-fragment conjugates with the 5′→3′ exonuclease activity of T4 DNApolymerase, or a like enzyme. When used in the presence of a singledeoxynucleoside triphosphate, such a polymerase will cleave nucleotidesfrom 3′ ends present on the non-template strand of a double strandedfragment until a complement of the single deoxynucleoside triphosphateis reached on the template strand. When such a nucleotide is reached the5′→3′ digestion effectively ceases, as the polymerase's extensionactivity adds nucleotides at a higher rate than the excision activityremoves nucleotides. Consequently, single stranded tags constructed withthree nucleotides are readily prepared for loading onto solid phasesupports.

The technique may also be used to preferentially methylate interior IIssites of a fragment while leaving a single IIs site at the terminus ofthe fragment unmethylated. First, the terminal IIs site is renderedsingle stranded using a polymerase with, e.g., deoxycytidinetriphosphate. The double stranded portion of the fragment is thenmethylated, after which the single stranded terminus is filled in with aDNA polymerase in the presence of all four nucleoside triphosphates,thereby regenerating the IIs site.

Use of Encoded Adaptors for Base-by-base Sequencing

Preferably, encoded adaptors are used in the sequencing method describedin Brenner U.S. Pat. No. 5,599,675. Each encoded adaptor comprises aprotruding strand and an oligonucleotide tag selected from a minimallycross-hybridizing set of oligonucleotides. Encoded adaptors whoseprotruding strands form perfectly matched duplexes with thecomplementary protruding strands of the target polynucleotide areligated. After ligation, the identity and ordering of the nucleotides inthe protruding strands are determined, or “decoded,” by specificallyhybridizing a labeled tag complement to its corresponding tag on theligated adaptor. As used herein, the term “de-coder” refers to labeledtag complements used in connection with encoded adaptors.

For example, if an encoded adaptor with a protruding strand of fournucleotides, say 5′-AGGT, forms a perfectly matched duplex with thecomplementary protruding strand of a target polynucleotide and isligated, the four complementary nucleotides, 3′-TCCA, on thepolynucleotide may be identified by a unique oligonucleotide tagselected from a set of 256 such tags, one for every possible fournucleotide sequence of the protruding strands. Tag complements areapplied to the ligated adaptors under conditions which allow specifichybridization of only those tag complements that form perfectly matchedduplexes (or triplexes) with the oligonucleotide tags of the ligatedadaptors. The tag complements may be applied individually or as one ormore mixtures to determine the identity of the oligonucleotide tags, andtherefore, the sequences of the protruding strands.

Encoded adaptors can have several embodiments depending, for example, onwhether single or double stranded tags are used, whether multiple tagsare used, whether a 5′ protruding strand or 3′ protruding strand isemployed, whether a 3′ blocking group is used, and the like. Formulasfor several embodiments of encoded adaptors are shown below. Preferredstructures for encoded adaptors using one single stranded tag are asfollows:

5′-p(N)_(n)(N)_(r)(N)_(s)(N)_(q)(N)_(t)-3′z(N′)_(r)(N′)_(s)(N′)_(q)-5′

or

p(N)_(r)(N)_(s)(N)_(q)(N)_(t)-3′3′-z(N)_(n)(N′)_(r)(N′)_(s)(N′)_(q)-5′

where N is a nucleotide and N′ is its complement, p is a phosphategroup, z is a 3′ hydroxyl or a 3′ blocking group, n is an integerbetween 2 and 6, inclusive, r is an integer greater than or equal to 0,s is an integer which is either between four and six whenever theencoded adaptor has a nuclease recognition site or is 0 whenever thereis no nuclease recognition site, q is an integer greater than or equalto 0, and t is an integer between 8 and 20, inclusive. More preferably,n is 4 or 5, and t is between 9 and 15, inclusive. Whenever an encodedadaptor contains a nuclease recognition site, the region of “r”nucleotide pairs is selected so that a predetermined number ofnucleotides are cleaved from a target polynucleotide whenever thenuclease recognizing the site is applied. The size of “r” in aparticular embodiment depends on the reach of the nuclease (as the termis defined in U.S. Pat. No. 5,599,675) and the number of nucleotidessought to be cleaved from the target polynucleotide. Preferably, r isbetween 0 and 20; more preferably, r is between 0 and 12. The region of“q” nucleotide pairs is a spacer segment between the nucleaserecognition site and the tag region of the encoded probe. The region of“q” nucleotide may include further nuclease recognition sites, labelingor signal generating moieties, or the like. The single strandedoligonucleotide of “t” nucleotides is a “t-mer” oligonucleotide tagselected from a minimally cross-hybridizing set.

The 3′ blocking group “z” may have a variety of forms and may includealmost any chemical entity that prevent inter-adaptor ligation and thatdoes not interfere with other steps of the method, e.g. removal of the3′ blocked strand, ligation, or the like. Exemplary 3′ blocking groupsinclude, but are not limited to, hydrogen (i.e. 3′ deoxy), phosphate,phosphorothioate, acetyl, and the like. Preferably, the 3′ blockinggroup is a phosphate because of the convenience in adding the groupduring the synthesis of the 3′ blocked strand and the convenience inremoving the group with a phosphatase to render the strand capable ofligation with a ligase. An oligonucleotide having a 3′ phosphate may besynthesized using the protocol described in chapter 12 of Eckstein,Editor, Oligonucleotides and Analogues: A Practical Approach (IRL Press,Oxford, 1991).

Further 3′ blocking groups are available from the chemistries developedfor reversable chain terminating nucleotides in base-by-base sequencingschemes, e.g. disclosed in the following references: Cheeseman, U.S.Pat. No. 5,302,509; Tsien et al, International application WO 91/06678;Canard et al, Gene, 148: 1-6 (1994); and Metzker et al, Nucleic AcidsResearch, 22: 4259-4267 (1994). Roughly, these chemistries permit thechemical or enzymatic removal of specific blocking groups (usuallyhaving an appendent label) to generative a free hydroxyl at the 3′ endof a priming strand.

Preferably, when z is a 3′ blocking group, it is a phosphate group andthe double stranded portion of the adaptors contain a nucleaserecognition site of a nuclease whose recognition site is separate fromits cleavage site.

When double stranded oligonucleotide tags are employed that specificallyhybridize with single stranded tag complements to form triplexstructures, encoded tags of the invention preferably have the followingform:

5′-p(N)_(n)(N)_(r)(N)_(s)(N)_(q)(N)_(t)-3′z(N′)_(r)(N′)_(s)(N′)_(q)(N)_(t)-5′

or

p(N)_(r)(N)_(s)(N)_(q)(N)_(t)-3′3′-z(N)_(n)(N′)_(r)(N′)_(s)(N′)_(q)(N′)_(t)-5′

where N, N′, p, q, r, s, z, and n are defined as above. Preferably, inthis embodiment t is an integer in the range of 12 to 40.

Clearly, there are additional structures which contain elements of thebasic designs set forth above that would be apparent to those with skillin the art. For example, encoded adaptors of the invention includeembodiments with multiple tags, such as the following:

5′-p(N)_(n)(N)_(r)(N)_(s)(N)_(q)(N)_(t1) . . .(N)_(tk)-3′z(N′)_(r)(N′)_(s)(N′)_(q)(N′)_(t1) . . . (N′)_(tk)-5′

or

p(N)_(r)(N)_(s)(N)_(q)(N)_(t1) . . .(N)_(tk)-3′3′-z(N)_(n)(N′)_(r)(N′)_(s)(N′)_(q)(N′)_(t1) . . .(N′)_(tk)-5′

where the encoded adaptor includes k double stranded tags. Preferably,t₁=t₂= . . . t_(k) and k is either 1, 2, or 3.

The tag complements of the invention can be labeled in a variety of waysfor decoding oligonucleotide tag, including the direct or indirectattachment of radioactive moieties, fluorescent moieties, colorimetricmoieties, chemiluminescent moieties, and the like. Many comprehensivereviews of methodologies for labeling DNA and constructing DNA adaptorsprovide guidance applicable to constructing adaptors of the presentinvention. Such reviews include Matthews et al, Anal. Biochem., Vol 169,pgs. 1-25 (1988); Haugland, Handbook of Fluorescent Probes and ResearchChemicals (Molecular Probes, Inc., Eugene, 1992); Keller and Manak, DNAProbes, 2nd Edition (Stockton Press, New York, 1993); and Eckstein,editor, Oligonucleotides and Analogues: A Practical Approach (IRL Press,Oxford, 1991); Wetmur, Critical Reviews in Biochemistry and MolecularBiology, 26: 227-259 (1991); and the like. Many more particularmethodologies applicable to the invention are disclosed in the followingsample of references: Fung et al, U.S. Pat. No. 4,757,141; Hobbs, Jr.,et al U.S. Pat. No. 5,151,507; Cruickshank, U.S. Pat. No. 5,091,519;(synthesis of functionalized oligonucleotides for attachment of reportergroups); Jablonski et al, Nucleic Acids Research, 14: 6115-6128(1986)(enzyme-oligonucleotide conjugates); Ju et al, Nature Medicine, 2:246-249 (1996); and Urdea et al, U.S. Pat. No. 5,124,246 (branched DNA).Attachment sites of labeling moieties are not critical, provided thatsuch labels do not interfere with the ligation and/or cleavage steps.

Preferably, one or more fluorescent dyes are used as labels for tagcomplements, e.g. as disclosed by Menchen et al., U.S. Pat. No.5,188,934; Bergot et al., PCT Pubn. No. WO 91/05060. As used herein, theterm “fluorescent signal generating moiety” means a signaling meanswhich conveys information through the fluorescent absorption and/oremission properties of one or more molecules. Such fluorescentproperties include fluorescence intensity, fluorescence life time,emission spectrum characteristics, energy transfer, and the like.

Attaching Tags to Restriction Fragments for Sorting onto Solid PhaseSupports

An important aspect of the invention is the sorting and attachment ofpopulations of DNA fragments, e.g. from a restriction digest, tomicroparticles or to separate regions on a solid phase support such thateach microparticle or region has substantially only one kind of fragmentattached. This objective is accomplished by insuring that substantiallyall different fragments have different tags attached. This condition, inturn, is brought about by taking a sample of the full ensemble oftag-fragment conjugates for analysis. (It is acceptable that identicalfragments have different tags, as it merely results in the same fragmentbeing operated on or analyzed twice in two different locations.) Suchsampling can be carried out either overtly—for example, by taking asmall volume from a larger mixture—after the tags have been attached tothe fragments, it can be carried out inherently as a secondary effect ofthe techniques used to process the fragments and tags, or sampling canbe carried out both overtly and as an inherent part of processing steps.

If a sample of n tag-fragment conjugates are randomly drawn from areaction mixture—as could be effected by taking a sample volume, theprobability of drawing conjugates having the same tag is described bythe Poisson distribution, P(r)=e^(−λ)(λ)^(r)/r, where r is the number ofconjugates having the same tag and λ=np, where p is the probability of agiven tag being selected. If n=10⁶ and p=1/(1.67×10⁷) (for example, ifeight 4-base words described in Brenner et al were employed as tags),then λ=0.0149 and P(2)=1.13×10⁻⁴. Thus, a sample of one millionmolecules gives rise to an expected number of doubles well within thepreferred range. Such a sample is readily obtained by serial dilutionsof a mixture containing tag-fragment conjugates.

As used herein, the term “substantially all” in reference to attachingtags to molecules, especially polynucleotides, is meant to reflect thestatistical nature of the sampling procedure employed to obtain apopulation of tag-molecule conjugates essentially free of doubles. Themeaning of substantially all in terms of actual percentages oftag-molecule conjugates depends on how the tags are being employed.Preferably, for nucleic acid sequencing, substantially all means that atleast eighty percent of the polynucleotides have unique tags attached.More preferably, it means that at least ninety percent of thepolynucleotides have unique tags attached. Still more preferably, itmeans that at least ninety-five percent of the polynucleotides haveunique tags attached.

Preferably, restriction fragments are conjugated to oligonucleotide tagsby inserting the fragments into a conventional cloning vector carrying atag library. For example, a pUC19 plasmid may be prepared for acceptingthe tag library of Formula I as follows: Into a Bam HI/Sac I-digestedpUC19 the following adaptor (SEQ ID NO: 3) is ligated to introduce a PacI site:

5′-CTTAATTAAG-3′

3′-TCGAGAATTAATTCCTAG-5′

After the recombinant plasmid is cloned and isolated, fragments from aSau 3A-digested target polynucleotide may be inserted into the Bam HIsite to form a tag-fragment library, which includes every possibletag-fragment pairing. A sample is taken from this library foramplification and sorting. Sampling may be accomplished by serialdilutions of the library, or by simply picking plasmid-containingbacterial hosts from colonies. After amplification, the tag-fragmentconjugates may be excised from the plasmid by Pac I/Xba I digestion. Theresidual Pac I site allows the oligonucleotide tag to be rendered singlestranded by T4 DNA polymerase digestion in the presence of dGTP.

After the oligonucleotide tags are prepared for specific hybridization,e.g. by rendering them single stranded as described above, thepolynucleotides are mixed with microparticles containing thecomplementary sequences of the tags under conditions that favor theformation of perfectly matched duplexes between the tags and theircomplements. There is extensive guidance in the literature for creatingthese conditions. Exemplary references providing such guidance includeWetmur, Critical Reviews in Biochemistry and Molecular Biology, 26:227-259 (1991); Sambrook et al, Molecular Cloning: A Laboratory Manual,2nd Edition (Cold Spring Harbor Laboratory, New York, 1989); and thelike. Preferably, the hybridization conditions are sufficientlystringent so that only perfectly matched sequences form stable duplexes.Under such conditions the polynucleotides specifically hybridizedthrough their tags may be ligated to the complementary sequencesattached to the microparticles. Finally, the microparticles are washedto remove polynucleotides with unligated and/or mismatched tags.

Preferably, for sequencing applications, standard CPG beads of diameterin the range of 20-50 μm are loaded with about 10⁵ polynucleotides, andglycidalmethacrylate (GMA) beads available from Bangs Laboratories(Carmel, Ind.) of diameter in the range of 5-10 μm are loaded with a fewtens of thousand polynucleotide, e.g. 4×10⁴ to 6×10⁴.

Specificity of the hybridizations of tag to their complements may beincreased by taking a sufficiently small sample so that both a highpercentage of tags in the sample are unique and the nearest neighbors ofsubstantially all the tags in a sample differ by at least two words.This latter condition may be met by taking a sample that contains anumber of tag-polynucleotide conjugates that is about 0.1 percent orless of the size of the repertoire being employed. For example, if tagsare constructed with eight words a repertoire of 8⁸, or about 1.67×10⁷,tags and tag complements are produced. In a library of tag-fragmentsconjugates as described above, a 0.1 percent sample means that about16,700 different tags are present. If this were loaded directly onto arepertoire-equivalent of microparticles, or in this example a sample of1.67×10⁷ microparticles, then only a sparse subset of the sampledmicroparticles would be loaded. The density of loaded microparticles canbe increase—for example, for more efficient sequencing—by undertaking a“panning” step in which the sampled tag-fragment conjugates are used toseparate loaded microparticles from unloaded microparticles. Thus, inthe example above, even though a “0.1 percent” sample contains only16,700 cDNAs, the sampling and panning steps may be repeated until asmany loaded microparticles as desired are accumulated. Alternatively,loaded microparticles may be separated from unloaded microparticles by afluorescently activated cell sorting (FACS) instrument usingconventional protocols after fragments have been fluorescently labeled.After loading and FACS sorting, the label may be cleaved prior toligating encoded adaptors, e.g. by Dpn I or like enzyme that recognizesmethylated sites.

A panning step may be implemented by providing a sample of tag-fragmentconjugates each of which contains a capture moiety at an end opposite,or distal to, the oligonucleotide tag. Preferably, the capture moiety isof a type which can be released from the tag-fragment conjugates, sothat the tag-fragment conjugates can be sequenced with a single-basesequencing method. Such moieties may comprise biotin, digoxigenin, orlike ligands, a triplex binding region, or the like. Preferably, such acapture moiety comprises a biotin component. Biotin may be attached totag-fragment conjugates by a number of standard techniques. Ifappropriate adapters containing PCR primer binding sites are attached totag-fragment conjugates, biotin may be attached by using a biotinylatedprimer in an amplification after sampling. Alternatively, if thetag-fragment conjugates are inserts of cloning vectors, biotin may beattached after excising the tag-fragment conjugates by digestion with anappropriate restriction enzyme followed by isolation and filling in aprotruding strand distal to the tags with a DNA polymerase in thepresence of biotinylated uridine triphosphate.

After a tag-fragment conjugate is captured, it may be released from thebiotin moiety in a number of ways, such as by a chemical linkage that iscleaved by reduction, e.g. Herman et al, Anal. Biochem., 156: 48-55(1986), or that is cleaved photochemically, e.g. Olejnik et al, NucleicAcids Research, 24: 361-366 (1996), or that is cleaved enzymatically byintroducing a restriction site in the PCR primer.

Physical Map Construction by Partial Methylation

As mentioned above, the invention may be implemented with the use ofrestriction enzymes which have methyl-sensitive isoschizomers, e.g. DpnI is a methyl-sensitive isoschizomer of Mbo I, Sau 3A, and Dpn II withrespect to dam methylation. That is, Dpn I is able to cleave only a GATCsite which is dam-methylated, whereas Mbo I and Dpn II, which alsocleave at GATC, are blocked by dam methylation. For such pairs ofrestriction endonucleases, ordered pairs of sequences may be prepared asshown in FIG. 4. Polynucleotide (400) contains restriction sites (412)is partially methylated (402), e.g. with a dam methylase (New EnglandBiolabs, Beverly, Mass.), so that the likelihood of adjacent sites beingmethylated is low. Preferably, methylation of adjacent sites is avoidedbecause double methylated fragments could lead to gaps or ambiguities inthe reconstructed map. On the other hand, the partial methylation mustbe complete enough so that at least one representative of every site ispresent in methylated form. If sites at some positions are completelyunmethylated, then a gap is created in the reconstructed map.Preferably, about 0.5 to about 2 percent of the restriction sites aremethylated. Partially methylated polynucleotide (400) is digested with arestriction endonuclease which is blocked from cleaving methylatedsites. The resulting fragments are cloned (406) into a conventionalcloning vector carrying a repertoire of oligonucleotide tags, afterwhich the cloning vector is expanded and fragment-containing vectors areisolated. After digestion with the methyl-sensitive isoschizomer, amarker fragment, e.g. supF or the like, is inserted into the openedsite, the re-circularized vectors are cloned, plated, and selected forthe presence of the inserted marker. A sufficiently large sample ofmarker-containing clones are harvested so that with high probability,preferably greater than 99%, all fragments of the polynucleotide arerepresented. Preferably, tag-containing fragments are then excised fromthe vectors and prepared for loading onto microparticles for sequencing,as described above.

EXAMPLE 1 Digestion and Loading Restriction Fragments from Phase λ forMPSS Analysis

In this example, aliquots of phage λ DNA are separately digested withTsp 509 I (recognizing 5′-AATT) and Dpn II (recognizing 5′-GATC).Restriction fragments from the separate digestions are inserted intopUC19 or pUC18 plasmids containing oligonucleotide tag repertoires, thusforming a library of tag-Tsp 509 I fragment conjugates and a library oftag-Dpn II fragment conjugates. Samples of about 10⁵ clones are obtainedfrom each library. (This is more than required to provide an adequaterepresentation of the populations, given that the complexity of thefragment mixture is only about 100-200 for phage λ. Also, the samplesize is still small enough so that it is only about 1% the complexity ofthe tag library described above, so there is a high probability thateach fragment will receive a unique tag). After sampling, tag-fragmentconjugates from the two samples are separately transfected into hostsand expanded in culture, after which the plasmids are isolated.Tag-fragment conjugates are then amplified from the plasmids by 4-5cycles of PCR in the presence of 5-methyldeoxycytosine triphosphateusing appropriate flanking vector sequences as primer binding sites.After amplification, the tags of the tag-fragment conjugates arerendered single stranded and loaded onto microparticles carrying tagcomplements. Dpn II and Tsp 509 I are selected for being able to cleaveDNA whose deoxycytosines are methylated at the 5-carbon position.

To facilitate the initiation of sequencing after methylation, thefollowing adaptor (SEQ ID NO: 4) is inserted into an Xba I-Sal Idigested pUC19:

5′-CTAGAAGCTGCGCTTGCTTTTGTTCGACGCGAACGAAAACAGCT

The tag library of Formula I is digested with Eco RI and Kpn I andinserted into the modified pUC19 (New England Biolabs, Beverly, Mass.)which is similarly digested, using conventional protocols. The resultingrecombinants are transfected into a suitable host (e.g. preferably,dam⁻, Stratagene, La Jolla, Calif.) and expanded in culture. Tag-pUC19recombinants isolated from the culture are digested with Bam HI andligated to Dpn II restriction fragments, after which the resultingrecombinant products are again transfected into a host and expanded toform a library of tag-Dpn II fragment conjugates. After isolation, asample of about 10⁵ tag-Dpn II fragment conjugates are obtained byserial dilution. The sample is re-transfected into fresh host bacteriaand expanded in culture. From a standard miniprep of plasmid, thetag-Dpn II fragment conjugates are amplified by PCR with5-methyldeoxycytosine triphosphate substituted for deoxycytosinetriphosphate. The following 19-mer forward and reverse primers (SEQ IDNO: 5 and SEQ ID NO: 6), specific for flanking sequences in pUC19, areused in the reaction:

forward primer: 5′-biotin-GAATTCGGGCCTTAATTAA

reverse primer: 5′-FAM-CAAAAGCAAGCGCAGCTTC

where “FAM” is an NHS ester of fluorescein (Clontech Laboratories, PaloAlto, Calif.) coupled to the 5′ end of the reverse primer via an aminolinkage, e.g. Aminolinker II (Perkin-Elmer, Applied Biosystems Division,Foster City, Calif.). The reverse primer is selected so that a Bbv Isite without methylated deoxycytosines can be reconstituted. This isaccomplished by using a reverse primer whose deoxycytosines areun-methylated and by carrying out a “stripping” reaction with T4 DNApolymerase in the presence of dATP (and absent the other dNTPs).

After PCR amplification, the tag-Dpn II fragments are isolated onavidinated beads, e.g. M-280 Dynabeads (Dynal, Oslo, Norway). Afterthorough washing, the 3′ strand in the region of the reverse primerstripped back to the initial adenosine by treatment with T4 DNApolymerase in the presence of dATP. dTTP, dCTP, and dGTP are then addedto the reaction to extend back the 3′ strand, thereby reconstituting theBbv I site without methylated deoxycytosines.

After another thorough washing, the fragments bound to the beads aredigested with Pac I releasing the tag-fragment conjugates and astripping reaction is carried out to render the oligonucleotide tagssingle stranded. After the reaction is quenched, the tag-fragmentconjugate is purified by phenol-chloroform extraction and combined with5.5 gm GMA beads carrying tag complements, each tag complement having a5′ phosphate. Hybridization is conducted under stringent conditions inthe presence of a thermal stable ligase so that only tags formingperfectly matched duplexes with their complements are ligated. The GMAbeads are washed and the loaded beads are concentrated by FACS sorting,using the fluorescently labeled cDNAs to identify loaded GMA beads.

Separately from above, the following tag library is constructed forpreparation of the tag-Tsp 509 conjugates (SEQ ID NO: 7 and SEQ ID NO:8):

Formula II      Left Primer           Kpn I 5′- AGAATTCGGGCCTTAATTAA            ↓ 5′- AGTCGACGGGCCTTAATTAA- [⁴(A,C,T)₈]-GGGTACCAAGTCAGAGTGAT    TCAGCTGCCCGGAATTAATT- [⁴(T,G,A)₈]-CCCATGGTTCAGTCTCACTA        ↑             ↑             CCCATGGTTCAGTCTCACTA -5′      SalI          Pac I                       Right Primer

This library is inserted into a pUC19 plasmid whose polylinker region ismodified so that the upstream Eco RI site is destroyed and a newsequence of restriction sites Sal I-Kpn I-Eco RI-Apo I is inserted inplace of the fragment between the Eco RI and Pst I sites of theunmodified pUC19. The modification is effected by digesting pUC19 withEco RI and Pst I, isolating the larger fragment, and ligating thefollowing adaptor (SEQ ID NO: 9) to the larger pUC19 fragment to formthe modified pUC19:

5′-AATTTGTCGACATCTTCTCTTGGTACCGAATTCAAATTTCTGCA       ACAGCTGTAGAAGAGAACCATGGCTTAAGTTTAAAG             ↑                  ↑      ↑      ↑            SalI              KpnI   Eco RI  Apo I

The tag library of Formula II is digested with Sal I and Kpn I andinserted into the modified pUC19 using conventional protocols, afterwhich the recombinants are transfected into a suitable host (e.g.preferably, damp, Stratagene, La Jolla, Calif.) and expanded in culture.Tsp 509 I fragments, which have compatible ends with Eco RI-digestedDNA, are readily inserted into the Eco RI site. The Apo I site providesa starting location for sequencing once the tag-fragment conjugates areloading onto beads. The stripping reaction is not required in this casebecause Apo I does not contain methylated deoxycytosines and would notfortuitously cleave the fragment since the fragment has already beendigested to completion with Tsp 509 I. Modified pUC19 recombinantsisolated from culture are digested with Eco RI and ligated to Tsp 509 Irestriction fragments, after which the resulting recombinant productsare again transfected into a host and expanded to form a library oftag-Tsp 509 I fragment conjugates. After isolation, a sample of about10⁵ tag-Tsp 509 I fragment conjugates are obtained by serial dilution.The sample is re-transfected into fresh host bacteria and expanded inculture. From a standard miniprep of plasmid, the tag-Tsp 509 fragmentconjugates are amplified by PCR with 5-methyldeoxycytosine triphosphatesubstituted for deoxycytosine triphosphate. The following 19-mer forwardand reverse primers (SEQ ID NO: 10 and SEQ ID NO: 11), specific forflanking sequences in pUC19, are used in the reaction:

forward primer: 5′-biotin-GTCGACGGGCCTTAATTAA

reverse primer: 5′-FAM-ACGTACGGACGTCTTTAAA

where “FAM” is as described above. After amplification, the tag-Tsp 509fragment conjugates are attached to beads as described above, exceptrather than reconstituting an unmethylated Bbv I site for initiatingsequencing, here the fragments only need be cleaved with Apo I togenerate a 4-nucleotide protruding strand to which the first sequencingadaptor is ligated.

EXAMPLE 2 Signature Sequencing Phase λ Restriction Fragments withEncoded Adaptors

In this example the Dpn II and Tsp 509 fragments loaded onto beads aresequenced, digested with Tsp 509 I and Dpn II, respectively, andsequenced again to generate ordered pairs of sequences for constructinga physical map. Fragments which fail to cleave carry encoded adaptorswhich must be inactivated prior to the start of the second round ofsequencing, otherwise spurious ordered pairs of sequence are obtained.This may be accomplished in several ways. For example, a restrictionsite may be included between the type IIs nuclease recognition site andthe protruding strand of the encoded adaptor, or the type IIs site ofthe encoded adaptor may be treated with a methylase prior to the secondround of sequencing. For encoded adaptors listed below, the type IIsnuclease recognition site is preferably activated by treating thefragments with Alu I methylase.

Beads loaded with tag-fragments conjugates are placed in an instrumentfor MPSS sequencing. Either two separate instruments are required foranalyzing the Dpn II fragments and Tsp 509 fragments, or the analysestake place one after the other on the same machine, i.e. in thisembodiment the loaded beads are not placed in the same chamber forsequencing. After loading and prior to sequencing, the FAM label iscleaved from the Dpn II fragments by Bbv I, which cleavage also leaves aprotruding strand to which the first sequencing adaptor is ligated.Similarly, prior to sequencing, the FAM label is cleaved from the Tsp509 fragments by Apo I, which cleavage likewise leaves a protrudingstrand to which the first sequencing adaptor is ligated. In both cases,the first sequencing adaptor carries a Bbv I site disposed on theadaptor so that Bbv I recognizing the site cleaves the fragment toexpose a protruding strand of unknown fragment sequence. The encodedadaptors of the set described below are applied to these protrudingstrands. Three cycles of ligation, identification, and cleavage arecarried out at the end of each fragment initially and after digestionwith either Dpn II or Tsp 509 to give two 12-nucleotide ordered pairs ofsequences for each fragment.

The top strands of the following 16 sets of 64 encoded adaptors (SEQ IDNO: 12 through SEQ ID NO: 27) are each separately synthesized on anautomated DNA synthesizer (model 392 Applied Biosystems, Foster City)using standard methods. The bottom strand, which is the same for alladaptors, is synthesized separately then hybridized to the respectivetop strands:

SEQ ID NO. Encoded Adaptor 12 5′-pANNNTACAGCTGCATCCCttggcgctgagg       pATGCACGCGTAGGG-5′ 13 5′-pNANNTACAGCTGCATCCCtgggcctgtaag       pATGCACGCGTAGGG-5′ 14 5′-pCNNNTACAGCTGCATCCCttgacgggtctc       pATGCACGCGTAGGG-5′ 15 5′-pNCNNTACAGCTGCATCCCtgcccgcacagt       pATGCACGCGTAGGG-5′ 16 5′-pGNNNTACAGCTGCATCCCttcgcctcggac       pATGCACGCGTAGGG-5′ 17 5′-pNGNNTACAGCTGCATCCCtgatccgctagc       pATGCACGCGTAGGG-5′ 18 5′-pTNNNTACAGCTGCATCCCttccgaacccgc       pATGCACGCGTAGGG-5′ 19 5′-pNTNNTACAGCTGCATCCCtgagggggatag       pATGCACGCGTAGGG-5′ 20 5′-pNNANTACAGCTGCATCCCttcccgctacac       pATGCACGCGTAGGG-5′ 21 5′-pNNNATACAGCTGCATCCCtgactccccgag       pATGCACGCGTAGGG-5′ 22 5′-pNNCNTACAGCTGCATCCCtgtgttgcgcgg       pATGCACGCGTAGGG-5′ 23 5′-pNNNCTACAGCTGCATCCCtctacagcagcg       pATGCACGCGTAGGG-5′ 24 5′-pNNGNTACAGCTGCATCCCtgtcgcgtcgtt       pATGCACGCGTAGGG-5′ 25 5′-pNNNGTACAGCTGCATCCCtcggagcaacct       pATGCACGCGTAGGG-5′ 26 5′-pNNTNTACAGCTGCATCCCtggtgaccgtag       pATGCACGCGTAGGG-5′ 27 5′-pNNNTTACAGCTGCATCCCtcccctgtcgga       pATGCACGCGTAGGG-5′

where N and p are as defined above, and the nucleotides indicated inlower case letters are the 12-mer oligonucleotide tags. Each tag differsfrom every other by 6 nucleotides. Equal molar quantities of eachadaptor are combined in NEB #2 restriction buffer (New England Biolabs,Beverly, Mass.) to form a mixture at a concentration of 1000 pmol/μL.

Each of the 16 tag complements are separately synthesized asamino-derivatized oligonucleotides and are each labeled with afluorescein molecule (using an NHS-ester of fluorescein, available fromMolecular Probes, Eugene, Oreg.) which is attached to the 5′ end of thetag complement through a polyethylene glycol linker (ClonetechLaboratories, Palo Alto, Calif.). The sequences of the tag complementsare simply the 12-mer complements of the tags listed above.

Ligation of the adaptors to the target polynucleotide is carried out ina mixture consisting of 5 μl beads (20 mg), 3 μL NEB 10×ligase buffer, 5μL adaptor mix (25 nM), 2.5 μL NEB T4 DNA ligase (2000 units/μL), and14.5 μL distilled water. The mixture is incubated at 16° C. for 30minutes, after which the beads are washed 3 times in TE (pH 8.0).

After centrifugation and removal of TE, the 3′ phosphates of the ligatedadaptors are removed by treating the polynucleotide-bead mixture withcalf intestinal alkaline phosphatase (CIP) (New England Biolabs,Beverly, Mass.), using the manufacturer's protocol. After removal of the3′ phosphates, the CIP may be inactivated by proteolytic digestion, e.g.using Pronase™ (available form Boeringer Mannhiem, Indianapolis, Ind.),or an equivalent protease, with the manufacturer's protocol. Thepolynucleotide-bead mixture is then washed, treated with a mixture of T4polynucleotide kinase and T4 DNA ligase (New England Biolabs, Beverly,Mass.) to add a 5′ phosphate at the gap between the targetpolynucleotide and the adaptor, and to complete the ligation of theadaptors to the target polynucleotide. The bead-polynucleotide mixtureis then washed in TE.

Separately, each of the labeled tag complements is applied to thepolynucleotide-bead mixture under conditions which permit the formationof perfectly matched duplexes only between the oligonucleotide tags andtheir respective complements, after which the mixture is washed understringent conditions, and the presence or absence of a fluorescentsignal is measured. Tag complements are applied in a solution consistingof 25 nM tag complement 50 mM NaCl, 3 mM Mg, 10 mM Tris-HCl (pH 8.5), at20° C., incubated for 10 minutes, then washed in the same solution(without tag complement) for 10 minute at 55° C.

After the four nucleotides are identified as described above, theencoded adaptors are cleaved from the polynucleotides with Bbv I usingthe manufacturer's protocol. After an initial ligation andidentification, the cycle of ligation, identification, and cleavage isrepeated three times to give the sequence of the 16 terminal nucleotidesof the target polynucleotide.

A flow chamber (500), diagrammatically represented in FIG. 3, isprepared by etching a cavity having a fluid inlet (502) and outlet (504)in a glass plate (506) using standard micromachining techniques, e.g.Ekstrom et al., PCT Pubn. No. WO 91/16966; Brown, U.S. Pat. No.4,911,782; Harrison et al., Anal. Chem. 64: 1926-1932 (1992); and thelike. The dimensions of flow chamber (500) are such that loadedmicroparticles (508), e.g. GMA beads, may be disposed in cavity (510) ina closely packed planar monolayer of 100-200 thousand beads. Cavity(510) is made into a closed chamber with inlet and outlet by anodicbonding of a glass cover slip (512) onto the etched glass plate (506),e.g. Pomerantz, U.S. Pat. No. 3,397,279. Reagents are metered into theflow chamber from syringe pumps (514 through 520) through valve block(522) controlled by a microprocessor as is commonly used on automatedDNA and peptide synthesizers, e.g. Bridgham et al., U.S. Pat. No.4,668,479; Hood et al., U.S. Pat. No. 4,252,769; Barstow et al., U.S.Pat. No. 5,203,368; Hunkapiller, U.S. Pat. No. 4,703,913; or the like.

Three cycles of ligation, identification, and cleavage are carried outin flow chamber (500) to give the sequences of 12 nucleotides at thetermini of each of appoximately 100,000 fragments, after which thefragments are cleaved with either Dpn II or Tsp 509 I and sequencedagain. Nucleotides of the fragments are identified by hybridizing tagcomplements to the encoded adaptors as described above. Specificallyhybridized tag complements are detected by exciting their fluorescentlabels with illumination beam (524) from light source (526), which maybe a laser, mercury arc lamp, or the like. Illumination beam (524)passes through filter (528) and excites the fluorescent labels on tagcomplements specifically hybridized to encoded adaptors in flow chamber(500). Resulting fluorescence (530) is collected by confocal microscope(532), passed through filter (534), and directed to CCD camera (536),which creates an electronic image of the bead array for processing andanalysis by workstation (538). Preferably, after each ligation andcleavage step, the cDNAs are treated with Pronase™ or like enzyme.Encoded adaptors and T4 DNA ligase (Promega, Madison, Wis.) at about0.75 units per μL are passed through the flow chamber at a flow rate ofabout 1-2 μL per minute for about 20-30 minutes at 16° C., after which3′ phosphates are removed from the adaptors and the cDNAs prepared forsecond strand ligation by passing a mixture of alkaline phosphatase (NewEngland Bioscience, Beverly, Mass.) at 0.02 units per μL and T4 DNAkinase (New England Bioscience, Beverly Mass.) at 7 units per μL throughthe flow chamber at 37° C. with a flow rate of 1-2 μL per minute for15-20 minutes. Ligation is accomplished by T4 DNA ligase (0.75 units permL, Promega) through the flow chamber for 20-30 minutes. Tag complementsat 25 nM concentration are passed through the flow chamber at a flowrate of 1-2 μL per minute for 10 minutes at 20° C., after whichfluorescent labels carried by the tag complements are illuminated andfluorescence is collected. The tag complements are melted from theencoded adaptors by passing hybridization buffer through the flowchamber at a flow rate of 1-2 μL per minute at 55° C. for 10 minutes.Encoded adaptors are cleaved from the cDNAs by passing Bbv I (NewEngland Biosciences, Beverly, Mass.) at 1 unit/μL at a flow rate of 1-2μL per minute for 20 minutes at 37° C.

After the ordered pairs of sequences have been collected, a physical mapof phage λ is constructed by matching overlapping sequences of theordered pairs.

27 20 nucleotides nucleic acid double linear 1 AGAATTCGGG CCTTAATTAA 2020 nucleotides nucleic acid double linear 2 GGGTACCAAG TCAGAGTGAT 20 18nucleotides nucleic acid double linear 3 GATCCTTAAT TAAGAGCT 18 22nucleotides nucleic acid double linear 4 CTAGAAGCTG CGCTTGCTTT TG 22 19nucleotides nucleic acid double linear 5 GAATTCGGGC CTTAATTAA 19 19nucleotides nucleic acid double linear 6 CAAAAGCAAG CGCAGCTTC 19 20nucleotides nucleic acid single linear 7 AGAATTCGGG CCTTAATTAA 20 20nucleotides nucleic acid single linear 8 ATCACTCTGA CTTGGTACCC 20 44nucleotides nucleic acid double linear 9 AATTTGTCGA CATCTTCTCTTGGTACCGAA TTCAAATTTC TGCA 44 19 nucleotides nucleic acid double linear10 GTCGACGGGC CTTAATTAA 19 19 nucleotides nucleic acid double linear 11ACGTACGGAC GTCTTTAAA 19 30 nucleotides nucleic acid double linear 12ANNNTACAGC TGCATCCCTT GGCGCTGAGG 30 30 nucleotides nucleic acid doublelinear 13 NANNTACAGC TGCATCCCTG GGCCTGTAAG 30 30 nucleotides nucleicacid double linear 14 CNNNTACAGC TGCATCCCTT GACGGGTCTC 30 30 nucleotidesnucleic acid double linear 15 NCNNTACAGC TGCATCCCTG CCCGCACAGT 30 30nucleotides nucleic acid double linear 16 GNNNTACAGC TGCATCCCTTCGCCTCGGAC 30 30 nucleotides nucleic acid double linear 17 NGNNTACAGCTGCATCCCTG ATCCGCTAGC 30 30 nucleotides nucleic acid double linear 18TNNNTACAGC TGCATCCCTT CCGAACCCGC 30 30 nucleotides nucleic acid doublelinear 19 NTNNTACAGC TGCATCCCTG AGGGGGATAG 30 30 nucleotides nucleicacid double linear 20 NNANTACAGC TGCATCCCTT CCCGCTACAC 30 30 nucleotidesnucleic acid double linear 21 NNNATACAGC TGCATCCCTG ACTCCCCGAG 30 30nucleotides nucleic acid double linear 22 NNCNTACAGC TGCATCCCTGTGTTGCGCGG 30 30 nucleotides nucleic acid double linear 23 NNNCTACAGCTGCATCCCTC TACAGCAGCG 30 30 nucleotides nucleic acid double linear 24NNGNTACAGC TGCATCCCTG TCGCGTCGTT 30 30 nucleotides nucleic acid doublelinear 25 NNNGTACAGC TGCATCCCTC GGAGCAACCT 30 30 nucleotides nucleicacid double linear 26 NNTNTACAGC TGCATCCCTG GTGACCGTAG 30 30 nucleotidesnucleic acid double linear 27 NNNTTACAGC TGCATCCCTC CCCTGTCGGA 30

What is claimed is:
 1. A method of ordering restriction fragments of atarget polynucleotide, the method comprising the steps of: (a) (i)producing a first population of restriction fragments by digestion ofthe target polynucleotide with a first restriction endonuclease having afirst recognition site, (ii) attaching each restriction fragment of thefirst population to a solid phase support by one end, such that copiesof each restriction fragment are attached to spatially discrete regionsof one or more solid phase supports; (iii) determining the nucleotidesequence of a portion of each free end of each restriction fragment ofthe first population; (iv) digesting each restriction fragment of thefirst population with a second restriction endonuclease to form a firstset of truncated support-bound restriction fragments, the secondrestriction endonuclease having a second recognition site different fromthat of the first restriction endonuclease, wherein at least one of saidrestriction endonucleases has a 4-basepair recognition sequence; (v)determining the nucleotide sequence of a portion of each free end ofeach truncated restriction fragment of the first set, so that an orderedpair of sequences is obtained for each restriction fragment of the firstpopulation; (b) (i) producing a second population of restrictionfragments by digestion of the target polynucleotide with the secondrestriction endonuclease, (ii) attaching each restriction fragment ofthe second population to a solid phase support by one end, such thatcopies of each restriction fragment are attached to spatially discreteregions of one or more solid phase supports; (iii) determining thenucleotide sequence of a portion of each free end of each restrictionfragment of the second population; (iv) digesting each restrictionfragment of the second population with the first restrictionendonuclease to form a second set of truncated support-bound restrictionfragments; (v) determining the nucleotide sequence of a portion of eachfree end of each truncated restriction fragment of the second set, sothat an ordered pair of sequences is obtained for each restrictionfragment of the second population; and (c) ordering the restrictionfragments produced by the first and second restriction endonucleases byaligning the matching nucleotide sequences from the ordered pairs ofsequences from the first and second populations of restrictionfragments; wherein each said step of attaching includes: attaching anoligonucleotide tag from a repertoire of tags to each restrictionfragment, such that each oligonucleotide tag from the repertoire isselected from the same minimally cross-hybridizing set ofoligonucleotides; wherein each of said oligonucleotide tags differs fromevery other oligonucleotide tag of said minimally cross-hybridizing setby at least three nucleotides; sampling said first population ofrestriction fragments such that substantially all different restrictionfragments in said first population have different oligonucleotide tagsattached; and specifically hybridizing the oligonucleotide tags withtheir respective tag complements, which are attached to spatiallydiscrete regions on one or more solid phase supports; and wherein eachstep of determining includes: ligating to the free end of eachsupport-bound restriction fragment, an encoded adaptor having aprotruding strand which forms a perfectly matched duplex with said freeend, said encoded adaptor further comprising an oligonucleotide tagselected from a minimally cross-hybridizing set of oligonucleotides, anda nuclease recognition site of a nuclease whose cleavage site isseparate from its recognition site; specifically hybridizing a labeledtag complement to said oligonucleotide tag of the encoded adaptor,identifying the type and sequence of nucleotides in the free end of therestriction fragment in accordance with the label carried by the tagcomplement; cleaving the fragment with a nuclease recognizing thenuclease recognition site of the encoded adaptor, such that the fragmentis shortened by one or more nucleotides; and repeating said ligating,hybridizing of labeled tag complements, and identifying steps, until adesired length of the nucleotide sequence of the end of the fragment isdetermined.
 2. The method of claim 1, wherein said nucleotide sequencesare at least 12 nucleotides in length.
 3. The method of claim 1, whereinsaid target polynucleotide is between 30 and 100 kilobases in length. 4.The method of claim 1, wherein said oligonucleotide tags are singlestranded.
 5. The method of claim 4, wherein said tag complements aresingle stranded.
 6. The method of claim 1, wherein each saidoligonucleotide tag consists of a plurality of subunits, each subunitconsisting of an oligonucleotide of 3 to 9 nucleotides in length andeach subunit being selected from the same minimally cross-hybridizingset of oligonucleotides.
 7. The method of claim 1, wherein saidrepertoire of said oligonucleotide tags contains at least 1000 of saidoligonucleotide tags.
 8. The method of claim 1, wherein each of saidoligonucleotide tags has a length in the range of from 12 to 60nucleotides.
 9. The method of claim 1, wherein each said spatiallydiscrete region is a microparticle.
 10. The method of claim 1, whereinsaid repertoire contains at least 10,000 of said oligonucleotide tags.11. The method of claim 1, wherein at least one of said one or moresolid phase supports is a planar substrate having a plurality ofspatially discrete surface regions.
 12. The method of claim 9, whereineach said microparticle has a diameter in the range of from 5 to 40_(μ)m.
 13. The method of claim 11, wherein each of said spatiallydiscrete surface regions has an area in the range of from 10 to 1000_(μ)m².